require(tidyverse)
require(DiagrammeR)
require(poppr)
require(genepop)
require(graph4lg)
require(related)
require(adegenet)
require(knitr)

Readme

This is document is an R notebook. If you’d like view to pre-rendered figures, read a summary of analysis and interact with code, please open the relevant html file in a browser.

To conduct a similar analyses on your computer, edit or run code: clone this repository into a directory on you r local machine and open the .Rproj file in Rstudio.

Rationale

Genotyping-in-Thousands by sequencing (GT-seq) is a cost effective method developed by Campbell et al (2015) to genotype up to thousands of samples at a small (~500) set of markers using the illumina platform.

This document:
(a) outlines the bioinformatic procedures for processing raw GT-seq reads to a fully filtered SNP dataset according to the standards adopted by the State Fisheries Genomics Lab at Oregon State University
(b) shares instructions for setting up the environment to run the scripts
(c) provides an example detailed protocol genotyping Oncorhynchus mykiss samples with figures and results
(d) directs GT-seq users to relevant metadata to conduct their own genotyping (i.e probe sequences)

GTseq outline

grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']
      tab6 [label = '@@6']
      tab7 [label = '@@7']
      tab8 [label = '@@8']
      # edge definitions with the node IDs
      tab1 -> tab2 [label = 'GTseq_BarcodeSplit']
      tab2 -> tab3 [label = 'GTseq_Genotyper']
      tab3 -> tab4 [label = 'GTseq_genocompile']
      tab4 -> tab5 [label = 'filtering']
      tab4 -> tab6 [label = 'GTseq_summaryfigs']
      tab3 -> tab7 [label = 'GTseq_genocompilecounts']
      tab7 -> tab5 
      tab1 -> tab8 [label = 'GTseq_seqtest']
      tab7 -> tab4 [label = 'GTseq_Omysex']
      tab6 -> tab5
      }
      [1]: 'Raw Reads'
      [2]: 'Demultiplezed fastqs'
      [3]: 'Individual genotypes'
      [4]: 'raw GTseq dataset'
      [5]: 'final filtered dataset'
      [6]: 'summary figures'
      [7]: 'read counts'
      [8]: 'marker info'
      
      ")

Full Gt-seq genotyping flowchart: nodes are data, edges are scripts

grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']
      # edge definitions with the node IDs
      tab1 -> tab2 [label = 'GTseq_BarcodeSplit']
      tab2 -> tab3 [label = 'GTseq_Genotyper']
      tab3 -> tab4 [label = 'GTseq_genocompile']
      tab4 -> tab5 [label = 'filtering']
      }
      [1]: 'Raw Reads'
      [2]: 'Demultiplezed fastqs'
      [3]: 'Individual genotypes'
      [4]: 'raw GTseq dataset'
      [5]: 'final filtered dataset'
      
      ")

Typical Gt-seq genotyping flowchart when using established panel: nodes are data, edges are scripts

Overview

The GTseq pipeline consists of a handful of perl and python scripts to process rw sequencing reads to a fully filtered SNP dataset. Here’s a quick reference to each of the steps in a typical pipeline.

Sequencing QC
- Check that the reads look good from the sequencer with fastqc or similar program

GT-seq Scripts
- Demultiplex reads: The first step is to demultiplex sequencing data if demuxed files are not provided by sequencing center using the GTseq_Barcode_Split_MP.py script
- Call genotypes: Use GTseq_Genotyper_v3.1.pl to call genotypes on demultiplexed reads.
- (Optional) Call Sex Genotypes: If species has unique script for calling the sex markers (O. mykiss and O. tshawytscha), run these as well
- Compile: After all the individual genotypes are called, compile them into a single output using the GTseq_GenoCompile_v3.pl script.

QAQC Check:
- Check positive and negative controls (for plate flipping, other library prep errors)
- Check known genotype controls if included (het winter summer controls)
- Check technical replicates
- Remove controls and replicates if all looks good

Filtering:
- IFI_cutoff = 2.5 (remove individuals with low confidence)
- GTperc_cutoff=90 (exclude individuals with greater than 10% missing data)
- Missingness (loci) > 20% (remove loci with >20% missing data) - Missingness (loci) > 10% (examine moderately bad loci for incorrect allele correction issues and attempt to rescue)
- Remove monomorphic SNPs
- Remove duplicated individuals

Environment setup and how to use

This SOP is run in two environments, the OSU cluster and a local instance of R. You can open this document as a .rmd in Rstudio when processing your own data, or simply copy and paste the code found here into your own terminal.

A few optional code chunks also use the terminal on a unix based machine (i.e. a mac), but they can also run entirely on the cluster, or on a windows based machine with a little work (just run the unix based commands in a unix/bash based terminal on a windows machine like cygwin or the new bash shell available on windows 10 install link here ).

Commands sent to the terminal appear as code chunks in this SOP and narrative is provided in the main body. Click the gray “code” box to see the commands and more details about a step. The appropriate environment for running each code chunk will appear as a comment at the head the chunk. There are also full scripts submitted as jobs to the server. These need to be saved as a file and submitted with the appropriate command from the server scheduler such as qsub or SGE_Batch.

Copying the directory in this repository named “example data” will provide all the relevant files to run the local portion (R) of the example protocol. See the code chunk below on how to do this.

# Local

# note: this is optional and only needed if you want to follow along with the example code on your own machine

# move to your working directory 
# create a directory for this repository and move inside
mkdir GT-seq
cd GT-seq

#copy the whole repository using git clone
git clone https://github.com/State-Fisheries-Genomics-Lab/GT-seq.git

# after you've cloned the repository, open the file GT-seq.Rproj in Rstudio for all features

Finally, paths to files and scripts in the SOP are on David Dayan’s directories, you will need to modify the scripts to reflect your own paths.

OSU Cluster Setup

Scripts
In order to run the GTseq, you must have a copy of the most current GTseq scripts stored on the cluster. As there are multiple versions of the scripts, it is reccomended to download them into directory on the cluster directly from the github repository, which is actively maintained.

Option A: File transfer GUI

Move all the files from the script directory to your own directory on the server using filezilla or whatever file transfer software you like.

Option B: Git

You can also clone whole repository onto the server with git and then move relevant files to where you want them


# SERVER

# clone the repository somewhere temporary
mkdir ~/tmp
cd ~/tmp
git clone https://github.com/State-Fisheries-Genomics-Lab/GT-seq


# choose home directory for your software
# make directory for GTseq scripts and move files inside
mkdir /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/
mv  ~/tmp/GT-seq/GT-seq_scripts/* /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/

#delete your temporary repo
bash #default shell is zcsh which requires you to confirm EVERY SINGLE file deletion
rm -r ~/tmp/* #caution scary, be sure there's no typo here
rmdir ~/tmp

Perl and Python Libraries

The GT-seq pipeline uses both perl and python. All of the relevant libraries should be installed on the server except StringApprox. You will need to install StringApprox and configure your server environment so it is included it in your path.

Installing:


# SERVER

# You have three options to install String::Approx, I confirmed the first one works, but the others may be even simpler

#### Option 1: make

# first download the tarball for String Approx here ( https://metacpan.org/release/String-Approx ) into a temporary directory on the server

# then, unpack the tarball and follow the instructions in the readme (below) 
perl Makefile.PL
        make
        make test
        make install

#### Option 2: Conda
conda install -c bioconda perl-string-approx

#### Option 3: cpan
perl -MCPAN -e shell
install String::Approx

Setting the environment:

Before running any scripts that require the String::Approx module, you must set the environment with the following commands. This will be noted in the example scripts but can also be found here.


#SERVER

setenv PERL5LIB ~/perl5/lib/perl5/x86_64-linux-thread-multi/ #for tcsh (default shell on server)

export PERL5LIB='/home/fw/dayand/perl5/lib/perl5/x86_64-linux-thread-multi/' # if using bash as your shell

Panel Info

This repository contains all the files needed to run the pipeline on new data and additional information about the panel markers. See readmes in each subdirectory for up to date contents.

Probe_Sequences
The probe sequences used by the genotyper script to call SNPs at the panel markers. Link here

Panel Info
Information about the panel markers such as amplicon sequences, annotations, sources, primer and probe sequences etc. Link here

Marker Mapping
Results from mapping studies. Link here

Example Script

Here is an example script of a full genotyping pipeline from raw reads to fully filtered SNP dataset.

Data Summary

The example dataset uses fall run and half-pounder steelhead from the Rogue River.

Sample Summary

Let’s gather all the metadata and summarize:

# LOCAL R

#intake files
half_2019_intake <- readxl::read_xlsx("example_data/metadata/STHP Intake form Spread sheet 2019.xlsx", sheet = 1)
fall_intake <- readxl::read_xlsx("example_data/metadata/Rogue Adult Summer and Winter (06 11 20).xlsx", sheet = 1)

#merge intakes
# first clean them up a bit to make merging easier
half_2019_intake <- half_2019_intake[,c(1,2)]
half_2019_intake$run <- "halfpounder"
half_2019_intake$year <- "2019"
colnames(half_2019_intake) <- c("ID", "Date", "run", "year")

fall_intake <- subset(fall_intake, Run == "Summer")
fall_intake <- fall_intake[,c(1,3,6)]
colnames(fall_intake) <- c("ID", "Date", "run")
fall_intake$run <- "fall"
fall_intake$year <- "2019"

meta_data <- bind_rows( half_2019_intake, fall_intake)

kable(meta_data %>%
  group_by(run, year) %>%
  tally())
run year n
fall 2019 166
halfpounder 2019 331

Sequencing Data Summary

Data came demultiplexed from sequencing center. Location in code chunk below

# SERVER

/dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux - 2019 fall and 2019 halfpounder

Clusters: 409,149,535
Yield (mbase): 61,782
% >Q30 bases: 87.13
Average Qual: 37.18

Ran fastqc report (not included in SOP)

Demultiplex

The first step is to demultiplex the raw sequencing file using the i5 and i7 indexes. We can actually skip this because the sequencing center already performed a demux. Instead, we just copy the demuxed files and give them more reasonable name.

# SERVER

#first move the demuxed files over
cp /nfs2/hts/illumina/200723_J00107_0245_AHHLG2BBXY_1504/L23/*Omy* /dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux

cp /nfs2/hts/illumina/200723_J00107_0245_AHHLG2BBXY_1504/L23/*positive* /dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux

cp /nfs2/hts/illumina/200723_J00107_0245_AHHLG2BBXY_1504/L23/negative* /dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux

#next rename to something easier to work with, this command takes the last 20 characters and tosses the rest

for file in ./*
do
   mv "$file" "${file:20}"
done

If, instead you are supplied with multiplexed fastq files (boo) you can use the GTseq scripts to demultiplex. First you need to generate a key file.

# LOCAL R

# example: Sample,PlateID,i7_name,i7_sequence,i5_name,i5_sequence
#          Sample123,P1234,i001,ACCGTA,25,CCCTAA
#          Sample321,P1234,i001,ACCGTA,26,GGCACA

#first lets get the index sequences 
index2020 <- read_tsv("example_data/metadata/index_2020.txt")
colnames(index2020) <- c("Sample","PlateID","i7_name","i7_sequence","i5_name","i5_sequence")
write_csv(index2020, "example_data/metadata/index_2020_lane.csv")

Then you run the script.

# SERVER

#note this is from Sandra's protocol so the directories are different, but still same approach

#Set directory
usedir='/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/OmyRogue'

#First argument is list of barcodes and second argument is .fastq file.
#Make sure you dos2unix the barcode file if you made it in Excel or Windows!

mkdir $usedir'/sample_fastqs'
cd $usedir'/sample_fastqs'

#here is the command you will submit
python '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/pipeline/GTseq_BarcodeSplit_MPargs.py' $usedir'/OmyROGR_index-list.csv' '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/RAW_READS/OmyROGROtsFERHCLAR.fq'

#here is how to wrap it correctly for submission to the server queue
SGE_Batch -q harold -r omysex -c "python '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/pipeline/GTseq_BarcodeSplit_MPargs.py' $usedir'/OmyROGR_index-list.csv' '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/RAW_READS/OmyROGROtsFERHCLAR.fq'"

Genotype

Main Genotyper
Next we’ll run the GTseq genotyper (v3.1) script on each fastq file to generate the individual genotypes (.genos) . Note that there are many copies of the genotyper script. 3.1 is the current version (assymetrical genotype caller bug fixed)

We need to run this script for each demultiplexed fastq. To speed this up we’ll use an array job. The code chunk below contains an example array job script. To run the job, you’ll save the script on the server and run it using the qsub command.

#!/bin/bash
#$ -S /bin/bash

#$ -t 1-554

#$ -tc 20

#$ -N GTseq-genotyperv3

#$ -cwd

#$ -o $JOB_NAME_$TASK_ID.out

#$ -e $JOB_NAME_$TASK_ID.err
export PERL5LIB='/home/fw/dayand/perl5/lib/perl5/x86_64-linux-thread-multi/'

FASTQS=(`ls /dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux/*fastq`)
INFILE=${FASTQS[$SGE_TASK_ID -1]}
OUTFILE=$(basename ${INFILE%.fastq.gz}.genos)

GTSEQ_GENO="/dfs/FW_HMSC/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_Genotyper_v3.1.pl
"

PROBE_SEQS="/dfs/FW_HMSC/Omalley_Lab/dayan/software/GTseq-Pipeline/Omy_GTseq390_ProbeSeqs.csv"

perl $GTSEQ_GENO $PROBE_SEQS $INFILE > $OUTFILE

#save this code chunk as a file on the server and submit this with qsub -q harold scriptname from the directory you want the output .genos files

Some notes about this script:
-t : this is the total number of task id to use. this script will iterate over this list. so if you have 100 individuals to genotype set this to 1-100.
-tc: this is the throttle. the value used in the script above (20), means we will submit the 554 separate commands in batches of 20 until it is done
submitting: I usually store all of the scripts I will submit to the server in a directory named “sge,” so if I wanted to submit this script (named gtseq_geno.txt) i’d submit the command qsub -q harold path/to/my/scripts/sge/gtseq_geno.txt . you can check the progress with the qstat command.

Sex Genotyper

After the genotypes are written for the panel, we add the sex genotyper.

# SERVER

# the omysex script is hardcoded to require the fastqs and genos to all be in a single collective directory, for fastqs to be decompressed, and for it to be run from this directory
# rather than try to fix this hardcoding in the script, a shortcut is to copy all the fastqs into the directory with your .genos from the previous step then deleted them after to conserve space
# another hardcoded part of the omysex script involves how it parses sample names all fastqs must be samplename.fastq and all genos must be samplename.genos

##### decompress (note this is a script to save and submit as a job)

#################################
#!/bin/bash
#$ -S /bin/bash
#$ -t 1-554
#$ -tc 20
#$ -N decompress
#$ -cwd
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err

FASTQS=(`ls *fastq.gz`)
INFILE=${FASTQS[$SGE_TASK_ID -1]}

gunzip -c $INFILE > ${INFILE%.gz}

#save as script and submit this with qsub -q harold scriptname
####################################

##### copy decompressed fastqs to directory with your .genos
cp /dfs/FW_HMSC/Omalley_Lab/dayan/half_pounder/genotyping/demux/*fastq ./

##### rename
for file in ./*fastq.genos
do
   mv "$file" "${file%.fastq.genos}".genos
done

##### sex genotyper
#below is the command to run the omysex script

SGE_Batch -q harold -r omysex -c 'perl /dfs/FW_HMSC/Omalley_Lab/dayan/software/GTseq-Pipeline/OmySEX_test_v3.pl'

##### don't forget to remove these dups when done

Compile Genotypes

After all the .genos are written, we compile them into a single output using the GenoCompile script


#this is run from within the .genos directory
SGE_Batch -q harold -r compile -c 'perl /dfs/FW_HMSC/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_GenoCompile_v3.pl > ../genotypes/half_pounder_GTs_0.1.csv'

Congratulations, you’ve run the pipeline! Now we will need to do quality control and filtering.

At this point in the suggested protocol, you should run the SummaryFigures script from the GTseq pipeline. However, this script contains harcoded values for genotype calling that are different from the actual genotyping script and will misrepresent the data. Do not run it. We will produce the same figures in the next section.

QAQC

Here we do some basic quality control. We check positive (known good DNA) and negative (known blank samples) controls, as well as control for known genotypes (i.e. known winter run, etc). After controls, we check for concordance among sample replicates to check that no wetlab errors occured that might have scrambled indexing/barcoding.

Controls

First let’s check that the controls worked well. We will check that negative controls have much fewer reads than average (there may be some on-target reads from othr samples due to index sequence error)

# LOCAL R

# first I cleaned up the sample names in the output compiled genotype file (.csv) with regex in a text editor - separated adapter info from sample name etc, you may not need to do this depending on who did your demultiplexing

# then read this file in to R
genos_0.1 <- read_csv("example_data/genotype_data/example_GTs_0.1.csv")

# lets set a value to mark controls
# here controls contained "positive," "negative" in their sample names so used simple pattern matching to create a new column

genos_0.1 <- genos_0.1 %>%
  mutate(control = ifelse(grepl("positive", Sample), "positive", ifelse(grepl("negative", Sample), "negative", "sample")))

# great let's plot
ggplot()+geom_histogram(data = genos_0.1, aes(x = `On-Target Reads`, fill= control)) + theme_classic()

Looks good. Negative controls have very few reads, positive are distributed, but lets just double check that there isn’t a negative control with a lot of reads hiding in there and indicating a plate flip:

ggplot()+geom_histogram(data = genos_0.1[genos_0.1$control=="negative",], aes(x = `On-Target Reads`)) + theme_classic()

Uh-oh, a negative control with ~6300 on target reads, but maybe there’s just a lot of reads at this adapter/well, lets examine as a portion of total reads, and also the portion GT’d.

ggplot()+geom_histogram(data = genos_0.1, aes(x = `%On-Target`, fill= control)) + theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Okay, that looks a lot better. The negative control with a lot of reads only has very small amount on target, much lower than most samples and positive controls. This suggests it is unlikely to be due to a wetlab error like flipping a plate upside down. Positive and negative controls check out. Also of note here is that the positive controls are biased down with respect to on target proportion, suggesting these samples are getting old.

Replicates

Some samples were replicated, let’s check for concordance in the genotypes, the pick the sample with better GT success and throw out the duplicate.

#LOCAL R

# here we filter out our known controls and create our next dataset genos_0.11
genos_0.11 <- genos_0.1 %>%
  filter(control == "sample") %>%
  filter(Sample != "Winter") %>%
  filter(Sample != "Heterozygous") %>%
  filter(Sample != "Summer")

#now let's get duplicated samples
dups <- genos_0.11[genos_0.11$Sample %in% genos_0.11$Sample[duplicated(genos_0.11$Sample)],]
dups <- dups[order(dups$Sample),]

# next we'll calculate the percent concordance among replicates
# woof I don't see a good way around using a nested for loop here, maybe fix this in the future

dups_genos <- dups[,8:ncol(dups)] # grab genotpyes and leave metadata out
rep_info <- matrix(ncol=ncol(dups_genos), nrow=nrow(dups_genos)/2)
colnames(rep_info) <- colnames(dups_genos)
for (j in 1:(nrow(dups_genos)/2)) {
for (i in 1:ncol(dups_genos)) {
  rep_info[j,i] <- sum(dups_genos[(j*2)-1,i]==dups_genos[(j*2),i])
}
  }

geno_concordance <- as.data.frame(as.matrix(rep_info)) %>%
  rowMeans()

rep_data <- as.data.frame(cbind(dups[c(1:length(geno_concordance))*2,2], geno_concordance))
ggplot(data=rep_data)+geom_histogram(aes(x=geno_concordance))+theme_classic()

There are ~50 replicated samples. Mean concordance of genotype across replicates is 87.3%. Next let’s examine whats going on with the samples with very low concordance.

#get the bad samples
bad_reps <- genos_0.11[genos_0.11$Sample %in% rep_data[rep_data$geno_concordance<0.50,1],1:7]
bad_reps[order(bad_reps$Sample),]

One of the pair just has extremely low %On-Target reads

Replication Summary
Replicate samples look good, bad replicates (<50% concordance) seems to be due to one replicate of the pair having extremely low GT success.

Next let’s make the 0.2 dataset (i.e. remove the replicates with lower GT success).

# LOCAL R

#this writes a new dataset (0.2) by choosing the samples within duplicates and keeping the one with the highest genotyping success
genos_0.2 <- genos_0.11 %>%
  group_by(Sample) %>%
  filter(`On-Target Reads` == max(`On-Target Reads`))

Filtering

Control and replicates have been removed, now it’s time for filtering.

Filtering Summary
- IFI_cutoff=2.5
- GTperc_cutoff=90 (inds greater than 10% missing data excluded)
- Missingness (loci) > 20% - Missingness (loci) > 10% - examine for allele correction issues
- Remove monomorphic SNPs

File readme
A key to different outputs in the filtering pipeline:
0.1: Raw, unfiltered GTs including all controls and replicates
0.2: Raw, unfiltered GTs, replicates and controls removed
0.3: filtered for IFI, GTperc (inds), and Missingness (loci) > 20%
0.4: filtered for IFI, GTperc (inds), and Missingness (loci) > 20%, and allele correction issues
1.0: 0.4 + removed monomorphic
2.0: 1.0 + duplicate sample removal
2.2: final genotype dataset with metadata (population, run, sample date)

IFI and Missingness

First we filter individuals and loci on IFI, and missingness.

Let’s take a look at the distribution of these values

ggplot(genos_0.2)+geom_histogram(aes(x=IFI))+geom_vline(aes(xintercept= 2.5), color="red")+theme_classic()

ggplot(genos_0.2)+geom_histogram(aes(x=`%GT`))+geom_vline(aes(xintercept= 90), color="red")+theme_classic()

missingness <- (colSums(genos_0.2[,8:398] == "00"))/nrow(genos_0.2)
missing <- as.data.frame(missingness)
missing$marker <- row.names(missing)
ggplot(missing) + geom_histogram(aes(x=missingness))+geom_vline(aes(xintercept= 0.2), color="red")+geom_vline(aes(xintercept= 0.1), color="blue")+theme_classic()+xlab("missingness (loci)")

Now let’s make the datasets.
0.3

# LOCAL R

#print table of bad IFI samples
kable(genos_0.2 %>%
  filter(IFI >2.5) %>%
    select(2:7), caption = "High IFI samples (low confidence barcodes)")
High IFI samples (low confidence barcodes)
Sample Raw Reads On-Target Reads %On-Target %GT IFI
OmyAC19ROGR_0138 563519 232629 41.28 88.49 3.10
OmyAC19ROGR_0142 642721 286827 44.63 91.82 2.58
OmyAC19ROGR_0143 499500 69489 13.91 88.49 2.80
OmyAC19ROGR_5052 335644 14778 4.40 72.89 2.91
OmyAC19ROGR_5507 429546 21564 5.02 79.54 3.62
# LOCAL R
#print table of bad missingness individual
kable(genos_0.2 %>%
  filter(`%GT` < 90) %>%
    select(2:7), caption = "Inidividuals with high missingess (>10% missing data)")
Inidividuals with high missingess (>10% missing data)
Sample Raw Reads On-Target Reads %On-Target %GT IFI
OmyAC19ROGR_0131 794216 247048 31.11 88.75 2.04
OmyAC19ROGR_0138 563519 232629 41.28 88.49 3.10
OmyAC19ROGR_0143 499500 69489 13.91 88.49 2.80
OmyAC19ROGR_4847 331766 22310 6.72 86.70 0.84
OmyAC19ROGR_4914 331506 4818 1.45 38.87 2.40
OmyAC19ROGR_4947 514196 13727 2.67 70.59 1.22
OmyAC19ROGR_5052 335644 14778 4.40 72.89 2.91
OmyAC19ROGR_5195 322639 12975 4.02 70.84 1.95
OmyAC19ROGR_5507 429546 21564 5.02 79.54 3.62
OmyAC19ROGR_5722 271387 2814 1.04 23.27 2.36
OmyAC19ROGR_5724 365075 16410 4.49 80.05 0.91
OmyAC19ROGR_5725 480551 13611 2.83 79.80 1.41
OmyJC19ROGR_0069 183568 22606 12.31 86.70 0.42
OmyJC19ROGR_0086 198018 12917 6.52 72.12 1.47
OmyJC19ROGR_0088 234849 22800 9.71 85.68 2.34
OmyJC19ROGR_0140 329422 24236 7.36 88.24 0.87
OmyJC19ROGR_0193 299946 17996 6.00 87.47 0.69
OmyJC19ROGR_0311 239390 9735 4.07 67.77 1.03
# LOCAL R
# collect bad markers
bad_markers <- missing[missing$missingness>0.2, 2]

#write the new dataset
genos_0.3 <- genos_0.2 %>%
  filter(IFI <2.5) %>%
  filter(`%GT` > 90) %>%
  dplyr::select(-one_of(bad_markers))

Filtering log 0.2 -> 0.3:
5 inds removed with IFI > 2.5
49 inds removed with genotying success less than 90%
0 loci removed with > 20% missingness

0.4

Next we examine markers with moderately bad genotyping success to attempt to figure out what’s going on.

The script below will pull the allele count ratios and read counts for all individuals in the pipeline

# SERVER

#run from directory with your .genos

#collect marker info from all the genos files
touch marker_info.txt
for file in ./*genos
do
    awk ' FS="," {print FILENAME,$1,$2,$3,$6,$7,$8}' $file >> marker_info.txt
done

#added headers (ind, marker, a1_count, a2_count, called_geno, a1_corr, a2_corr) and cleaned up sample names with regex in texteditor to be more readable (not necessary)

Let’s take a look at the bad markers (10-20% missingness)

# Local R

#get marker names of markers with 0.1 > missingness > 0.2
miss0.1 <- missing[missing$missingness > 0.1,]
miss_mod <- miss0.1[miss0.1$missingness < 0.2, 2]

There are 0 markers with moderately poor genotyping success (i.e. less than 20% missingess, but greater than 10% missingness) in the example data. For this example pipeline we’re using some good markers to visualize this process instead.

Previously these types of markers were filtered on an ad hoc basis based on the extent of individuals with unclear genotypes (a lot of individuals not falling in the clear cutoffs for GT). Will attempt the same below:

Let’s explore look at one of the markers individual. We’ll create a similar plot to those generated by the GTseq figures script.

Below we show plots for one representative good marker, and two representative bad markers with separate issues. Then we’ll discuss

# R LOCAL

# this is an example code chunk don't run it in your own pipeline

marker_info <- read_tsv("example_data/genotype_data/marker_info.txt")
marker_info$a1_count <- as.numeric(marker_info$a1_count)
marker_info$a2_count <- as.numeric(marker_info$a2_count)


ggplot(data=marker_info[marker_info$marker=="OMS00008",])+geom_point(aes(a1_count, a2_count, color = called_geno), alpha = 0.7, size =2)+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+xlim(c(0,150))+ylim(c(0,150))+ggtitle("Good Marker")

ggplot(data=marker_info[marker_info$marker=="Omy_RAD46672-27",])+geom_point(aes(a1_count, a2_count, color = called_geno), alpha = 0.7, size =2)+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+xlim(c(0,150))+ylim(c(0,150))+ggtitle("Bad Marker, A2 Bias")

ggplot(data=marker_info[marker_info$marker=="Omy_RAD52458-17",])+geom_point(aes(a1_count, a2_count, color = called_geno), alpha = 0.7, size =2)+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+xlim(c(0,150))+ylim(c(0,150))+ggtitle("Bad Marker\nA1 bias in a subset of samples")

The good marker here demonstrates the qualities we’re looking for. Heterozygotes are distributed around a 1:1 ratio of alternative alleles, and homozygotes have few to no reads at the alternative allele.

The first bad marker shows a typical issue, reads among putative heterozygotes are biased towards the A2 allele. This is potentially due to a paralogue at this locus and we would filter this SNP out of the final dataset (or attempt to rescue it by changing the allele correction values if we are optimizing the panel for the first few runs).

The second bad marker shows a less common problem. There is bias towards the A1 allele, but only in a subset of individuals. One explanation for this observation is that there is polymorphism among the paralogs. In any case, toss this locus.

When actually running the pipeline for yourself use the R code chunk below. This code chunk builds similar graphs for any bad loci (missingess 10% - 20%) and allows you make a decision on each one.

# R Local

marker_info <- read_tsv("example_data/genotype_data/marker_info.txt")
marker_info$a1_count <- as.numeric(marker_info$a1_count)
marker_info$a2_count <- as.numeric(marker_info$a2_count)

plots <- marker_info %>%
  group_by(marker) %>%
  do(plots=ggplot(data=.)+geom_point(aes(a1_count, a2_count, color = called_geno))+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+ggtitle(unique(.$marker)))

#plot all "bad markers"
mod_bad_plot_index <- which(plots$marker %in% miss_mod)

# loop through the plots by changing the index (here 10) until you have looked at all your questionable markers
plots$plots[[mod_bad_plot_index[10]]] #manually looped through these plots by changing the index for all 28 moderately bad markers, could make an lapply loop in the future, bad markers reported below

Remove the bad markers

# Local R
to_filt <- c("Omy_RAD46672-27", "Omy_RAD52458-17") # here list your bad marker names
genos_0.4 <- genos_0.3 %>%
  dplyr::select(-one_of(to_filt))

Monomorphic Markers and Duplicates

1.0 Monomorphic Markers

To generate the 1.0 dataset, we remove monomorphic markers

genos_1.0 <- genos_0.4 %>% 
  select_if(~ length(unique(.)) > 1)

Duplicate Samples

Some sample tissues are provided in batches of fin clips. Let’s make sure no fin clips broke apart leading to a single individual to be represented twice in the dataset. Rather than fussing with installing coancestry for windows on a unix system, estimated relatedness using an R package (related) which can implement the code from Coancestry.

To run the code in Coancestry on a windows machine, use the GUI.

We used the estimator from Lynch and Ritland 1999 #not dyadic likelihood estimator, Milligan (2003)

# Local R

# The input file needs a unique row for each indiviudal and two columns for each diploid locus
# threw out metadata and wrote to a file
# then we split all the genotype values using regex in a text editor (after converting all na values to 0)
#   find string: \t([ATGC0XY])([ATCG0XY])  replace string: \t\1\t\2
# convert genos to numbers and removed sex marker
#convert to integer T-1 G->2 etc

just_genos <- genos_1.0[,c(2, 8:357)]
write_tsv(just_genos, "genotype_data/just_genos.txt")

# now run coancestry
rmat <- coancestry("./genotype_data/just_genos.txt", dyadml = 1)
rmat2 <- coancestry("./genotype_data/just_genos.txt", lynchrd  = 1)

# save the relevant info so we don't have to run this over and over and take up a ton of diskspace
rmat_to_save <- rmat2$relatedness[rmat2$relatedness$lynchrd > 0.5,]
save(rmat_to_save, file="genotype_data/relatedness.Rdata")

Check for highly related individuals and remove any >= 0.95 from the dataset

# LOCAL R

#Check for relatedness
load(file = "genotype_data/relatedness.Rdata")
#ggplot(rmat_to_save$relatedness)+geom_histogram(aes(x=lynchrd))+theme_classic()
rmat_to_save[which(rmat_to_save$lynchrd >=0.95), c(1:3)]

dup_inds <- rmat_to_save[which(rmat_to_save$lynchrd >= 0.95), c(1:3)]

#if you used the coancestry GUI, you can just create a vector here manually like below
#dup_inds <- c("dupicate sample name 1", "dupicate sample name 2" , etc)

genos_2.0 <- genos_1.0 %>%
  filter(!(Sample %in% dup_inds$ind2.id))

File Conversion and Stats

Final step of genotyping is to collect some stats about the genotype dataset and reformat the genotype file into common formats for import into other programs.

Stats

Here are some summary stats and figures from your filtered dataset

# LOCAL R

ggplot(genos_2.0)+geom_density(aes(x=`On-Target Reads`))+geom_vline(aes(xintercept=median(`On-Target Reads`)), color = "red") +theme_classic()
On Target Read Distribution

On Target Read Distribution

#LOCAL R
ggplot(genos_2.0)+geom_density(aes(x=`%On-Target`))+geom_vline(aes(xintercept=median(`%On-Target`)), color = "red") +theme_classic()
Proportion on Target

Proportion on Target

Depths

#LOCAL R

#code to estimate depth at filtered loci
marker_info %>%
  filter(marker %in% colnames(genos_2.0)) %>%
  mutate(sumdepth=a1_count+a2_count) %>%
  summarise(mean=mean(sumdepth, na.rm = TRUE), median=median(sumdepth, na.rm = TRUE), sd=sd(sumdepth, na.rm = TRUE))
marker_info %>%
  filter(marker %in% colnames(genos_2.0)) %>%
  mutate(sumdepth=a1_count+a2_count) %>%
  ggplot + aes(x=sumdepth)+geom_histogram()+theme_classic()+xlab("Mean Depth Per Locus Per Individual")

Conversion

Let’s get some usable file formats

Here’s adegenet’s genind object

#LOCAL R

# Convert to genind for import into adegenet

#first get a matrix to work on

#first change column to not include a dot
genos_2.1 <- genos_2.0
colnames(genos_2.1) <- gsub("\\.", "_", colnames(genos_2.1))
#convert to matrix with inds as row names
genos_2.1 <- as.matrix(genos_2.1[,c(8:362)])
row.names(genos_2.1) <- genos_2.0$Sample
genind_1.0 <- df2genind(genos_2.1, sep ="", ploidy=2,NA.char = "0")

#add in the populations
genos_2.2 <- genos_2.0 %>%
  left_join(meta_data, by=c("Sample" = "ID"))

genind_1.0@pop <- as.factor(genos_2.2$run)

Here’s a general approach using radiator package

# LOCAL R

# note didn't do this yet, but check out the command: 
radiator::genomic_converter()

Finally, save your files as R objects for further analysis.

# LOCAL R

# here we save a few objects with useful info
genind_2.0 <- genind_1.0
save(genos_2.2, file ="./genotype_data/genotypes_2.2.R")
save(genind_2.0, file= "./genotype_data/genind_2.0.R")